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Abstract 

Distance weighted discrimination (DWD) is a margin-based classifier with an inter¬ 
esting geometric motivation. DWD was originally proposed as a superior alternative 
to the support vector machine (SVM), however DWD is yet to be popular compared 
with the SVM. The main reasons are twofold. First, the state-of-the-art algorithm 
for solving DWD is based on the second-order-cone programming (SOCP), while the 
SVM is a quadratic programming problem which is much more efficient to solve. Sec¬ 
ond, the current statistical theory of DWD mainly focuses on the linear DWD for the 
high-dimension-low-sample-size setting and data-piling, while the learning theory for 
the SVM mainly focuses on the Bayes risk consistency of the kernel SVM. In fact, 
the Bayes risk consistency of DWD is presented as an open problem in the original 
DWD paper. In this work, we advance the current understanding of DWD from both 
computational and theoretical perspectives. We propose a novel efficient algorithm for 
solving DWD, and our algorithm can be several hundred times faster than the existing 
state-of-the-art algorithm based on the SOCP. In addition, our algorithm can handle 
the generalized DWD, while the SOCP algorithm only works well for a special DWD 
but not the generalized DWD. Furthermore, we consider a natural kernel DWD in a 
reproducing kernel Hilbert space and then establish the Bayes risk consistency of the 
kernel DWD. We compare DWD and the SVM on several benchmark data sets and 
show that the two have comparable classification accuracy, but DWD equipped with 
our new algorithm can be much faster to compute than the SVM. 
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1 Introduction 

Binary classification problems appear from diverse practical applications, such as, financial 
fraud detection, spam email classification, medical diagnosis with genomics data, drug re¬ 
sponse modeling, among many others. In these classihcation problems, the goal is to predict 
class labels based on a given set of variables. Suppose that we observe a training data 
set consisting of n pairs, where {(a :*,Xi G and Ui G { — 1,1}. A classiher hts 
a discriminant function / and constructs a classification rule to classify data point Xi to 
either class 1 or class —1 according to the sign of f{xi). The decision boundary is given by 
{x : f{x) = 0}. Two canonical classihers are linear discriminant analysis and logistic regres¬ 
sion. Modern classihcation algorithms can produce hexible non-linear decision boundaries 
with high accuracy. The two most popular approaches are ensemble learning and support 
vector machines/kernel machines. Ensemble learning such as boosting (Freund and Schapire, 
1997) and random forest (Breiman, 2001) combine many weak learners like decision trees 
into a powerful one. The support vector machine (SVM) (Vapnik, 1995, 1998) hts an optimal 
separating hyperplane in the extended kernel feature space which is non-linear in the original 
covariate spaces. In a recent extensive numerical study by Fernandez-Delgado et ah (2014), 
the kernel SVM is shown to be one of the best among 179 commonly used classihers. 

Motivated by “data-piling” in the high-dimension-low-sample-size problems, Marron et al. 
(2007) invented a new classihcation algorithm named distance weighted discrimination (DWD) 
that retains the elegant geometric interpretation of the SVM and delivers competitive per¬ 
formance. Since then much work has been devoted to the development of DWD. The readers 
are referred to Marron (2015) for an up-to-date list of work on DWD. On the other hand, 
we notice that DWD has not attained the popularity it deserves. We can think of two 
reasons for that. First, the current state-of-the-art algorithm for DWD is based on second- 
order-cone programming (SOCP) proposed in Marron et al. (2007). SOCP was an essential 
part of the DWD development. As acknowledged in Marron et al. (2007), SOCP was then 
much less well-known than quadratic programming, even in optimization. Furthermore, 
SOCP is generally more computationally demanding than quadratic programming. There 
are two existing implementations of the SOCP algorithm: Marron (2013) in Matlab and 


2 


Huang et al. (2012) in R. With these two implementations, we hnd that DWD is usually 
more time-consuming than the SVM. Therefore, SOCP contributes to both the success and 
unpopularity of DWD. Second, the kernel extension of DWD and the corresponding kernel 
learning theory are under-developed compared to the kernel SVM. Although Marron et al. 
(2007) proposed a version of non-linear DWD by mimicking the kernel trick used for deriving 
the kernel SVM, theoretical justihcation of such a kernel DWD is still absent. On the con¬ 
trary, the kernel SVM as well as the kernel logistic regression (Wahba et ah, 1994; Zhu and 
Hasite, 2005) have mature theoretical understandings built upon the theory of reproducing 
kernel Hilbert space (RKHS) (Wahba, 1999; Hastie et ah, 2009). Most learning theories of 
DWD succeed to Hall et al. (2005) ’s geometric view of HDLSS data and assume that p —)■ cxd 
and n is hxed, as opposed to the learning theory for the SVM where n —)■ cxd and p is hxed. 
We are not against the hxed n and p —)■ oo theory but it would be desirable to develop the 
canonical learning theory for the kernel DWD when p is hxed and n —)■ cxd. In fact, how to 
establish the Bayes risk consistency of the DWD and kernel DWD was proposed as an open 
research problem in the original DWD paper (Marron et ah, 2007). Nearly a decade later, 
the problem still remains open. 

In this paper, we aim to resolve the aforementioned issues. We show that the kernel 
DWD in a RKHS has the Bayes risk consistency property if a universal kernel is used. This 
result should convince those who are less familiar with DWD to treat the kernel DWD as 
a serious competitor to the kernel SVM. To popularize the DWD, it is also important to 
allow practitioners to easily try DWD collectively with the SVM in real applications. To 
this end, we develop a novel fast algorithm to solve the linear and kernel DWD by using 
the majorization-minimization (MM) principle. Compared with the SOCP algorithm, our 
new algorithm has multiple advantages. First, our algorithm is much faster than the SOCP 
algorithm. In some examples, our algorithm can be several hundred times faster. Second, 
DWD equipped with our algorithm can be faster than the SVM. Third, our algorithm is 
easier to understand than the SOCP algorithm, especially for those who are not familiar with 
semi-dehnite and second-order-cone programming. This could help demystify the DWD and 
hence may increase its popularity. 

To give a quick demonstration, we use a simulation example to compare the kernel DWD 
and the kernel SVM. We drew 10 centers {(-1^^} from V((1,0)^,/). For each data point in 
the positive class, we randomly picked up a center and then generated the point from 
V(^i^_,_, 1/5). The negative class was assembled in the same way except that 10 centers 
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SVM - with Gaussian Kernel 


DWD — with Gaussian Kernel 
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Figure 1. Nonlinear SVM and DWD with Gaussian kernel. The broken curves are the Bayes 
decision boundary. The R package kerndwd used 2.396 second to solve the kernel DWD, and 
kernlab took 7.244 second to solve the kernel SVM. The timings include tuning parameters and 
they are averaged over 100 runs. 


were drawn from A^((0,1)^, I). For this model the Bayes rule is nonlinear Figure 1 displays 
the training data from the simulation model where 100 observations are from the positive 
class (plotted as triangles) and another 100 observations are from the negative class (plotted 
as circles). We htted the SVM and DWD using Gaussian kernels. We have implemented 
our new algorithm for DWD in a publicly available R package kerndwd. We computed the 
kernel SVM by using the R package kernlab (Karatzoglou et al., 2004). We recorded their 
training errors and test errors. From Figure 1, we observe that like the kernel SVM, the 
kernel DWD has a test error close to the Bayes error, which is consistent with the Bayes 
risk consistency property of the kernel DWD established in section 4.2. Notably, the kernel 
DWD is about three times as fast as the kernel SVM in this example. 

The rest of the paper is organized as follows. To be self-contained, we hrst review the 
SVM and DWD in section 2. We then derive the novel algorithm for DWD in section 3. We 
introduce the kernel DWD in a reproducing kernel Hilbert space and establish the learning 
theory of kernel DWD in section 4. Real data examples are given in section 5 to compare 
DWD and the SVM. Technical proofs are provided in the appendix. 

^The Bayes decision boundary is a curve: {z : exp (—5||2; — /T) = J2k (~5||2: — pj,_|p/2)} . 
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2 Review of SVMs and DWD 


2.1 SVM 


The introduction of the SVM usually begins with its geometric interpretation as a maximum 
margin classiher (Vapnik, 1995). Consider a case when two classes are separable by a hy¬ 
perplane {x : /(x) = uq + x'^uj = 0} such that yi{ojQ -|- x'fuj) are all non-negative. Without 
loss of generality, we assume that is a unit vector, i.e., = 1, and we observe that 

each di = yi{ujQ + xjuj) is equivalent to the Euclidean distance between the data point Xi 
and the hyperplane. The reason is that di = {xi — Xq)'^uj and coq + x'^uj = 0, where Xq is 
any data point on the hyperplane and n; is the unit normal vector. The SVM classiher is 
dehned as the optimal separating hyperplane that maximizes the smallest distance of each 
data point to the separating hyperplane. Mathematically, the SVM can be written as the 
following optimization problem (for the separable data case): 


max mindj, 

subject to di = yi{oJo -|- x'^u) > 0, Vi, and = 1. 


( 2 . 1 ) 


The smallest distance mindj is called the margin, and the SVM is thereby regarded as a 
large-margin classifier. The data points closest to the hyperplane, i.e., di = mindj, are 
dubbed the support vectors. 

In general, the two classes are not separable, and thus yi{oJo + xfuj) cannot be non¬ 
negative for all i = 1,..., n. To handle this issue, non-negative slack variables pi, 1 < i < n, 
are introduced to ensure all yfiujQ + xJuj) +pi to be non-negative. With these slack variables, 
the optimization problem (2.1) is generalized as follows. 


max mindj, 

subject to di = yfiujQ -\- xju;) Pi >0, Vi, 

n 

Vi 0) y^Vi constant, and — 1. 


( 2 . 2 ) 


2 = 1 


To compute SVMs, the optimization problem (2.2) is usually rephrased as an equivalent 
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quadratic programming (QP) problem, 


min 

/3o,/3 


i=l 


subject to yiiPo + > b > 0, Vt, 


(2.3) 


and it can be solved by maximizing its Lagrange dual function, 


max 

f-^i 


subject to 


n ^ n n 

_i=l i=l i'=l 

n 

/Xi > 0 and = 0. 

i=l 


(2.4) 


By solving (2.4), one can show that the solution of (2.3) has the form 

n n 

P and thus f{x) = /3o + '^{iiyi{x,Xi), (2.5) 

i=l i=l 

fii being zero only when Xi lies on the support vectors. 

One widely used method to extend the linear SVM to non-linear classifiers is the kernel 
method (Aizerman et ah, 1964), which replaces the dot product {xi,Xii) in the Lagrange 
dual problem (2.4) with a kernel function K{xi, a;'), and hence the solution has the form 

n 

f{x) = /3o + x'^p = (3o + y^^jliyiK{x,Xi). 

i=l 

Some popular examples of the kernel function K include: K{x, x') = {x, x') (linear kernel), 
K{x,x') = {a + {x,x'))'^ (polynomial kernel), and K{x,x') = exp(—cT||a; — a;'!!^) (Gaussian 
kernel), among others. 


2.2 DWD 

2.2.1 Motivation 

Distance weighted discrimination was originally proposed by Marron et ah (2007) to resolve 
the data-piling issue. Marron et al. (2007) observed that many data points become support 
vectors when the SVM is applied on the so-called high-dimension-low-sample-size (HDLSS) 
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Figure 2. A toy example illustrating the data-piling. Values /do + xf 0 are plotted for SVM and 
DWD. Indices 1 to 50 represent negative class (triangles) and indices 51 to 100 are for positive 
class (circles). In the left panel, data points belonging to the support vectors are depicted as solid 
circles and triangles. 


data, and Marron et al. (2007) coined the term data-piling to describe this phenomenon. 
We delineate it in Fignre 2 throngh a simnlation example. Let = (3, 0,..., 0) be a 200- 
dimension vector. We generated 50 points (indexed from 1 to 50 and represented as triangles) 
from N{—pi, Ip) as the negative class and another 50 points (indexed from 51 to 100 and 
represented as circles) from N^pi, Ip) as the positive class. We compnted (Iq and (3 for SVM 
(2.3). In the left panel of Fignre 2, we plotted (Iq + xJ(3 for each data point, and we portrayed 
the snpport vectors by solid triangles and circles. We observe that 65 ont of 100 data points 
become snpport vectors. The right panel of Figure 2 corresponds to DWD (will be defined 
shortly), where data-piling is attenuated. A real example revealing the data-piling can be 
seen in Figure 1 of Ahn and Marron (2010). 

Marron et al. (2007) viewed “data-piling” as a drawback of the SVM, because the SVM 
classifier (2.5) is a function of only support vectors. Another popular classifier logistic 
regression does classification by using all the data points. However, the classical logistic 
regression classifier is derived by following the maximum likelihood principle, not based on a 
nice margin-maximization motivation^. Marron et al. (2007) wanted to have a new method 

^Zhu and Hasite (2005) later showed that the limiting (.2 penalized logistic regression approaches the 
margin-maximizing hyperplane for the separable data case. DWD was first proposed in 2002. 
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that is directly formulated by a SVM-like margin-maximization picture and also uses all 
data points for classification. To this end, Marron et ah (2007) proposed DWD which hnds 
a separating hyperplane minimizing the total inverse margins of all the data points: 


min 



subject to 


di = yi{uo + xjuj) + r]i>0, r]i > 0, 'ii, and 


1 . 


( 2 . 6 ) 


There has been much work on variants of the standard DWD. We can only give an 
incomplete list here. Qiao et ah (2010) introduced the weighted DWD to tackle unequal 
cost or sample sizes by imposing different weights on two classes. Huang et al. (2013) 
extended the binary DWD to the multiclass case. Wang and Zou (2015) proposed the sparse 
DWD for high-dimensional classihcation. In addition, the work connecting DWD with other 
classihers, e.g., SVM, includes but not limited to LUM (Liu et al., 2011), DWSVM (Qiao 
and Zhang , 2015a), and FLAME (Qiao and Zhang , 2015b). Marron (2015) provided a more 
comprehensive review of the current DWD literature. 


2.2.2 Computation 

Marron et al. (2007) solved the standard DWD by reformulating (2.6) as a second-order cone 
programming (SOCP) program (Alizadeh and Goldfarb, 2004; Boyd and Vandenberghe, 
2004), which has a linear objective, linear constraints, and second-order-cone constraints. 
Specihcally, for each i, let pi = {1/di + di)/ 2 , ai = {1/di — di)/2, and then pi + ai = 1/di, 
Pi — = di, and p‘f — a/ = 1. Hence the original optimization problem (2.6) becomes 

(2 7) 

subject to p — cr = YXu: + uq ■ y + p, \ ■ j 

Vi > 0, {ppai,!) e S 3 , Wi, (l;a;) G Sp+i, 

where W is an n x n diagonal matrix with the ith diagonal element i/i, X is an n x p data 
matrix with the ith row xf, and S'^+i = {{'4’, 0) G > 0^0} is the form of the 

second-order cones. After solving Cjq and uj from (2.7), a new observation a^new is classihed 
by sign(cJo + xI^^lS). 


min 






2.2.3 Non-linear extension 


Note that the kernel SVM was derived from applying the kernel trick to the dnal formulation 
(2.5). Marron et ah (2007) followed the same approach to consider a version of kernel DWD 
for achieving non-linear classihcation. The dual function of the problem (2.7) is (Marron 
et ah, 2007) 


max 

CX 


- -Ja^YXX^Ya + 2 ■ sfa , 


( 2 . 8 ) 


subject to 


OL = 0, 0 < CK < c ■ 1, 


where (\/a)i = \Ab) i = 1,2,... ,n. Note that (2.8) only uses XX^, which makes it easy 
to employ the kernel trick to get a nonlinear extension of the linear DWD. For a given kernel 
function K, define the kernel matrix as {K)ij = K{Xi, Xj), 1 < i, j < n. Then a kernel 
DWD can be defined as (Marron et ah, 2007) 


max 

Ot 


- Vcx^YKYcx + 2 ■ , 


subject to y^cx = 0, 0 < ck < c • 1. 


(2.9) 


To solve (2.9), Marron et al. (2007) used the Cholesky decomposition of the kernel matrix, 
i.e., K = and then replaced the predictors X in (2.7) with $. Marron et al. (2007) 
also carefully discussed several algorithmic issues that ensure the equivalent optimality in 
(2.7) and (2.8). 

Remark 1. Two DWD implementations have been published thus far: a Matlab software 
(Marron, 2013) and an R package DWD (Huang et ah, 2012). Both implementations are based 
on a Matlab SOCP solver SDPT3, which was developed by Tutiincii et al. (2003). We notice 
that the R package DWD can only compute the linear DWD. 

Remark 2. To our best knowledge, the theoretical justification for the kernel DWD in 
Marron et al. (2007) is still unclear. The reason is likely due to the fact that the nonlinear 
extension is purely algorithmic. In fact, the Bayes risk consistency of DWD was proposed as 
an open research problem in Marron et al. (2007). The kernel DWD considered in this paper 
can be rigorously justihed to have a universal Bayes risk consistency property; see details in 
section 4.2. 
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2.2.4 Generalized DWD 


Marron et al. (2007) also attempted to replace the reciprocal in the DWD optimization 
problem (2.6) with the gth power {q > 0) of the inverse distances, and Hall et ah (2005) also 
used it as the original dehnition of DWD. We name the DWD with this new formulation the 
generalized DWD; 


min 




2 = 1 


2=1 


subject to 


di = Viiujo + xJlj) +r]i>0, 7]i> 0, Vi, and 


1 , 


( 2 . 10 ) 


which degenerates to the standard DWD (2.6) when q = 1. 

The hrst asymptotic theory for DWD and generalized DWD was given in Hall et al. 
(2005) who presented a novel geometric representation of the HDLSS data. Assuming 
are the data from the positive class and , X 2 , • • •, are from 
the negative class. Hall et al. (2005) stated that, when the sample size n is hxed and the 
dimension p goes to inhnity, under some regularity conditions, there exist two constants 
and l~ such that for each pair of i and j. 


4^2/+, andp-^/^Xr-X-|| 4 72/-, 

as p —)■ 00 . This result was applied the results to study several classihers including the SVM 
and the generalized DWD. For ease presentation let us consider the equal subgroup size case, 
i.e., 77 ,+ = n_ = n/2. Hall et al. (2005) assumed that — EX~\\ —>■ p, as p —)■ cxd. 

The basic conclusion is that when p is greater than a threshold that depends on /+,/“,n, 
the misclassihcation error converges to zero, and when p is less than the same threshold, the 
misclassihcation error converges to 50%. For more details, see Theorem 1 and Theorem 2 in 
Hall et al. (2005). Ahn et al. (2007) further relaxed the assumptions thereof. 

Remark 3. The generalized DWD has not been implemented yet because the SOCP trans¬ 
formation only works for the standard DWD (g = 1) (2.7), but its extension to handle the 
general cases is unclear if not impossible. That is why the current DWD literature only 
focuses on DWD with g = 1. In fact, the generalized DWD with g % 1 was proposed as an 
open research problem in Marron et al. (2007). The new algorithm proposed in this paper 
can easily solve the generalized DWD problem for any g > 0; see section 3. 
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3 A Novel Algorithm for DWD 


Marron et al. (2007) originally solved the standard DWD by transforming (2.6) into a SOCP 
problem. This algorithm, however, cannot compute the generalized DWD (2.10) with q ^ 
1. In this section, we propose an entirely different algorithm based on the majorization- 
minimization (MM) principle. Our new algorithm offers a unihed solution to the standard 
DWD and the generalized DWD. 


3.1 Generalized DWD loss 

Our algorithm begins with a loss + penalty formulation of the DWD. Lemma 1 deploys the 


result. Note that the loss function also lays the foundation of the kernel DWD learning 
theory that will be discussed in section 4. 


Lemma 1. The generalized DWD classifier in (2.10) can be written as sign{/3o + xJf3) , where 
00 ,^) is computed from 



(3.1) 


for some X, where 



(3.2) 


Remark 4. The proof of Lemma 1 provides the one-to-one mapping between A in (3.1) and 
c in (2.10). Write (/3(A)o, ;9(A)) as the solution to (3.1). Dehne 



Considering (2.10) using c(A), 



(3.3) 


subject to di = yi{uJo -|- xju;) + Pi > 0, pi> 0, Vi, and uj^uj = 1, 
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we have 


ch = ^(A)/||^(A)|| and (ho = /3(A)o/||^(A)||. 

Note that sign((ho + xJCj) = sign(/3(A)o + xff3{\)), which means that the generalized DWD 
classifier defined by (3.3) is eqnivalent to the generalized DWD classifier defined by (3.1). 

By Lemma 1, we call V^(-) the generalized DWD loss. It can be visnalized in Fignre 3. 
We observe that the generalized DWD loss decreases as q increases and it approaches the 
SVM hinge loss fnnction as g — )■ cxd. When q = 1, the generalized DWD loss becomes 

(l-u, ifu< 1/2, 

Vi{u) = i 

[^1/(4m), if m > 1/2. 

We notice that Viiu) has appeared in the literatnre (Qiao et ah, 2010; Lin et ah, 2011). In 
this work we give a nnified treatment of all q valnes, not jnst g = 1. 



Figure 3. Top to bottom are the DWD loss functions with q = 0.5,1,4,8, and the SVM hinge loss. 


3.2 Derivation of the algorithm 

We now show how to develop the new algorithm by nsing the MM principle (De Leenw 
and Heiser, 1977; Lange et ah, 2000; Hnnter and Lange, 2004). Some recent snccessfnl 
applications of the MM principle can be seen in Hnnter and Li (2005); Wn and Lange 
(2008); Zon and Li (2008); Zhon and Lange (2010); Yang and Zon (2013); Lange and Zhon 
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(2014), among others. The main idea of the MM principle is easy to understand. Suppose 
6 = [I3o,0^Y' and we aim to minimize C(0), dehned in (3.1). The MM principle hnds a 
majorization function D{0\0k) satisfying C{0) < D{0\0k) for any 0 ^ Ok and C{6k) = 
D{6k\0k), and then we generate a sequence {C{6k)}'^=i by updating Ok via Ok Ok+i = 
argmin^ D{0\0k). 

We hrst expose some properties of the generalized DWD loss functions, which give rise 
to a quadratic majorization function of C{0). The generalized DWD loss is differentiable 
everywhere; its hrst-order derivative is given below. 



- 1 , 


1 ( q 

yq+l + 1 


9+1 




if M < 
if M > 




g +1’ 

g 

g + 1 ’ 


(3.4) 


Lemma 2. 


The generalized DWD loss function Vq{-) has a Lipschitz continuous gradient, 


K(t) - y'ffll < M\t - i 


(3.5) 


which further implies a guadratic majorization function ofVq{-) such that 

v,(t) < v,(i) + v;m + iy (3.6) 

for any t ^ t and M = (g + l)^/g. 

Denote the current solution hj 0 = (/^q, P )^ and the updated solution hy 0 = {(do, /3^)^- 
We settle C{0) = C{j3o,(3) and D{0\0) = D{fdo,f3) without abusing notations. We have 
that for any {/3o, (3) 7 ^ (^o,3), 


_1 
n 


Yi V, (»(ft + xj0)) + A/3^/3 


2 = 1 
n 


<- (yAPo + a;f/3)j yi{j3Q - /3o) + yiX^{f3 - (3) 

^ i=l ^ 1=1 

M 


(3.7) 


2 n 


- h) + VixJif3 - f3) 

2=1 


+ X(3^ (3 
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We now find the minimizer of D{I3q,(3). The gradients of D{(3Q,f3) are given as follows: 


df3 






2 = 1 


M 

yiXi + — 2^ 
2 = 1 


n 


% - h) + /3) 


Xi + 2A/3 


rr M , -.T’ M ^ rr, 

=X^z H-(/?o - H- XixJ{(3- (3) + 2X(3 

n n 

2 = 1 

=X^2 + —(/?o - + (—X^X + 2XIp] {f3-^) + 2A3, 

n \ n ) 

in M ^ 

=- Y1K (^*(^0 + ~ - h) + xl{f3- 3 ) 

2 = 1 2 = 1 

=l^z + M(/3o - h) + —- 3). 

n 


(3.8) 


(3.9) 


where X is the nxp data matrix with the ith row , z is an nx 1 vector with the ith element 
yiVq{yi{^Q+x'iy))/n^an.(il G M"'is the vector of ones. Setting [c}iI>(/3o,/3)/9/9o! /3)/5/3] 

to be zeros, we obtain the minimizer of D{I3q,(3)\ 


PA h \ _r^( n l^X \ A l^z 

/3 J [ ^ J M x^l X^X + J [ X^z + 2A3 


So far we have completed all the steps of the MM algorithm. Details are summarized in 
Algorithm 1. 

We have implemented Algorithm 1 in an R package kerndwd, which is publicly available 
for download on CRAN. 


3.3 Performance of the new algorithm 

In this section, we show the superior computation performance of our R implementation, 
kerndwd, over the two existing implementations, the R package DWD (Huang et ah, 2012) and 
the Matlab software (Marron, 2013). To avoid confusion, we henceforth use OURS, HUANG, 
and MARRON to denote kerndwd, DWD, and the Matlab implementation, respectively. Since 
HUANG is incapable of non-linear kernels and the generalized DWD with q ^ 1, we only attend 
to the linear DWD with q hxed to be one. All experiments were conducted on an Intel Core 
i5 M560 (2.67 GHz) processor. 

For a fair comparison, we study the four numerical examples used in Marron et al. (2007), 
except for different sample sizes and dimensions. In each example, we generate a data set 


14 






Algorithm 1 Linear generalized DWD 

1 : Initialize (/3o,/3 ) 

2 : for each A do 
3: Compute P^^(A): 


P-^(A) = 


n 

X^-\ 


X T V I 2nA T 
' M -^P 


-1 


4: repeat 

5: Compute z = (zi,..., = yiV'{yi0o + XiP))/n 

6 : Compute: 


/^o 

/3 




/3o 

3 


ng 


(g +1)‘ 


P-^(A) 


X^z + 2A3 


7: Set 00 , f) = 00,(30 

8 : until the convergence condition is met 

9: end for 


with sample size n = 500 and dimension p = 50. The responses are always binary; one half 
of the data have responses +1 and the other half have —1. Data in example 1 are generated 
from Gaussian distribution with means of (±2.2, 0,..., 0) and an identity covariance for ±1 
classes respectively. Example 2 has 80% of data drawn as example 1 whereas the other 20% 
from Gaussian distributions with means of (±100, ±500,0,..., 0) for ±1 classes. In example 
3, 80% of the data are obtained as example 1 as well, while the means of the remaining 20% 
have the first coordinate replaced by ±0.1 and one randomly chosen coordinate replaced by 
±100 for ±1 classes. For example 4, at the first 25 coordinates, the data from —1 class are 
standard Gaussian and the data from +1 class are 11.09 times standard Gaussian; for both 
classes, the last 25 coordinates are just the squares of the first 25. 

In each example, we htted a linear DWD with Eve different tuning parameter values 
A = (0.01,0.1,1,10,100). After obtaining 0o,(3), we computed (±0,^^) and the constant c 
in (2.7) by using Remark 4. We then used HUANG and MARRON to compute their solutions. 
Note that in theory all three implementations should yield identical (±0,^;). From table 1 
we observe that OURS took remarkably less computation time than HUANG and MARRON. In 
example 1, for instance, OURS spent only 0.012 second on average to fit a DWD model, while 
HUANG used 14.525 seconds, and MARRON took 2.204 seconds, which were 1210 and 183 times 
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larger, respectively. In all four examples, the timings of OURS were 700 times above faster 
than the existing R implementation HUANG, and also more than 70 times faster than the 
Matlab implementation MARRON^. 

Table 1. Timing comparisons among the R package kerndwd (denoted as OURS), the R package 
DWD (denoted as HUANG), and the Matlab implementation (denoted as MARRON). All the timings are 
averaged over 100 independent replicates. 



Timing (in 

sec.) 

Ratio 


OURS 

HUANG 

MARRON 

t(HUANG) 

t(MARRON) 


t(OURS) 

t(OURS) 

1 

0.012 

14.525 

2.204 

1210.8 

183.7 

2 

0.024 

18.018 

2.411 

750.8 

100.5 

3 

0.028 

26.918 

2.076 

961.4 

74.1 

4 

0.020 

21.536 

2.264 

1076.8 

113.2 


4 Kernel DWD in RKHS and Bayes Risk Consistency 
4.1 Kernel DWD in RKHS 

The kernel SVM can be derived by using the kernel trick or using the view of non-parametric 
function estimation in a reproducing kernel Hilbert space (RKHS). Much of the theoretical 
work on the kernel SVM is based on the RKHS formulation of SVMs. The derivation of the 
kernel SVM in a RKHS is given in Hastie et ah (2009). We take a similar approach to derive 
the kernel DWD, as our goal is to establish the kernel learning theory for DWD. 

Consider TLk, a reproducing kernel Hilbert space generated by the kernel function K. 
The Mercer’s theorem ensures K to have an eigen-expansion K{x, x') = {x'), 

with 74 > 0 and < oo- Then the Hilbert space TLk is dehned as the collection of 

functions h{x) = for any 9t such that ^t/lt < OO; ^rid the inner product 

Given TLk, let the non-linear DWD be written as sign(;do + h{x)) where (/3o, h) is the 

^We also checked the quality of the computed solutions by these different algorithms. In theory they 
should be identical. In practice, due to machine errors and implementations, they could be different. We 
found that in all examples our new algorithm gave better solutions in the sense that the objective function 
in (2.7) has the smallest value. HUANG and MARRON gave similar but slightly larger objective function values. 
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solution of 


n 


min 
hG'HK 

where V^(-) is the generalized DWD loss (3.2). The representer theorem concludes that the 
solution of (4.1) has a finite expansion based on K{x,Xi) (Wahba, 1990), 


n 


'^Vg{yi{/3o + h{xi))) + 


2 

Hk 


i=l 


(4.1) 


h{x) = ^aiK{x,Xi), 

i=l 

and thus 

n n 

\Mhk = '^'^aiajK{xi,Xj). 
i=l j=l 

Consequently, (4.1) can be paraphrased with matrix notation. 


min Ck{/3o,o^) 


min 

/3o,a 


1 

-Y.^1 


(4.2) 


where K is the kernel matrix with the (i, j)th element of K{xi, Xj) and Ki is the ith column 

of K. 

Remark 5. We can compare (4.2) to the kernel SVM (Hastie et ah, 2009) 


min 

l3o,a 


n 


+ Xcx^Kcx 


(4.3) 


where [1 —1]+ is the hinge loss underlying the SVM. As shown in Figure 3, the generalized 
DWD loss takes the hinge loss as its limit when g —)• oo. In general, the generalized DWD 
loss and the hinge loss look very similar, which suggests that the kernel DWD and the kernel 
SVM equipped with the same kernel have similar statistical behavior. 

The procedure for deriving Algorithm 1 for the linear DWD can be directly adopted 
to derive an efficient algorithm for solving the kernel DWD. We obtain the majorization 
function DK{(3o,a), 


Dk{Po, ck ) 


1 ” 

-XK{M + Kja)) 


ViiPo- M +yiKf{a 



+ XoF Kol 
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Algorithm 2 Kernel DWD 

1: Initialize (/3o, 

2: for each A do 
3: Compute P^^(A): 


P-^(A) 


n 

K1 


VK 


KK + 


2nqX 


K 


4: repeat 

5: Compute 2 = {zi, Znf'- Zi = yiV'{yi0o + KiCt))/n 

6: Compute: 


(3o 

OL 




/^O 

OL 


nq 


{q + iy 


P-\X) 


P z 

Kz + 2XKa 


7: Set (/3o, ol'^) = (/So, ol'^) 

8: until the convergence condition is met 

9: end for 


+ 


M 

2n 


- M 

i=l 


+ yiKj{ct 


CK 


+ 


1 ^ 

-Y.y,{y<(A + Kf&)) 


and then find the minimizer of Dk{/3o,ol) which has a closed-form expression. We opt to 
omit the details here for space consideration. Algorithm 2 summarizes the entire algorithm 
for the kernel DWD. 


4.2 Kernel learning theory 

Lin (2002) formulated the kernel SVM as a non-parametric function estimation problem in 
a reproducing kernel Hilbert space and showed that the population minimizer of the SVM 
loss function is the Bayes rule, indicating that the SVM directly approximates the optimal 
Bayes classifier. Lin (2004) further coined a name “Fisher consistency” to describe such a 
result. The Vapnik-Chervonenkis (VC) analysis (Vapnik, 1998; Anthony and Bartlett, 1999) 
and the margin analysis (Bartlett and Shawe-Taylor, 1999; Shawe-Taylor and Cristianini, 
2000) have been used to bound the expected classification error of the SVM. Zhang (2004) 
used the so-called leave-one-out analysis (Jaakkola and Haussler, 1999) to study a class of 
kernel machines. The exisiting theoretical work on the kernel SVM provides us a nice road 
map to study the kernel DWD. In this section we first elucidate the Fisher consistency (Lin, 
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2004) of the generalized kernel DWD, and we then establish the Bayes risk consistency of 
the kernel DWD when a universal kernel is employed. 

Let ri{x) denote the conditional probability P{Y = 1|X = x). Under the 0-1 loss, the 
theoretical optimal Bayes rule is f*{x) = sign{ri{x) — 1 / 2 ). Assume rjlx) is a measurable 
function and P{r]{x) = 1/2) = 0 throughout. 


Lemma 3. The population minimizer of the expected generalized DWD loss ExyiVq (E /(^))] 
is 


/(*) 


q 

q + l 


h{.x) 

1 — ri{x) 


1 

<3 + 1 


■I(?](x) > 1 / 2 ) 


A 

V J 


1 

9 + 1 


■ l(h(aj) < 1/2) 


(4.4) 


where /(■) is the indicator function. The population minimizer f{x) has the same sign as 
r]{x) - 1 / 2 . 


Fisher consistency is a property of the loss function. The interpretation is that the 
generalized DWD can approach Bayes rule with inhnite many samples. We notice that 
Fisher consistency of Vi{u) has been shown before (Qiao et ah, 2010; Liu et al., 2011). In 
reality all classihers are estimated from a hnite sample. Thus, a more rehned analysis of the 
actual DWD classiher is needed, and that is what we achieve in the following. 

Following the convention in the literature, we absorb the intercept into h and present the 
kernel DWD as follows: 


fn = argmin 
J&Hk 


1 


(4.5) 


The ultimate goal is to show that the misclassihcation error of the kernel DWD approaches 
the Bayes error rate such that we can say the kernel DWD classiher works as well as the Bayes 
rule (asymptotically speaking). Following Zhang (2004), we derive the following lemma. 

Lemma 4. For a discrimination function f, we define R{f) = [U 7 ^ sign (/(X))]. 
Assume that f* = argminj/2(/) is the Bayes rule and fn is the solution of (4.5), then 

R{fn) - R{n <—{eA + Be), (4.6) 

q 
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where Ea and Ee are defined as follows and Vq is the generalized DWD loss, 


Ea — inf Bxy 
JeHk 


£e — £E{fn) — ExY 


VfiYfiX)) 

vJyUx] 


-E 


XY 


VjYfiX) 


inf Exy 

IeHk 


V,(Yf(X)) 


(4.7) 


In the above lemma R{f*) is the Bayes error rate and R{fn) is the misclassihcation error 
of the kernel DWD applied to new data points. If R{fn) —t R{f*), we say the classiher is 
Bayes risk consistent. Based on Lemma 4, it suffices to show that both ea and Ee approach 
zero in order to demonstrate the Bayes risk consistency of the kernel DWD. Note that Ea 
is deterministic and is called the approximation error. If the RKHS is rich enough then 
the approximation error can be made arbitrarily small. In the literature, the notation of 
universal kernel (Steinwart, 2001; Micchelli et ah, 2006) has been proposed and studied. 
Suppose df G is the compact input space of X and C{X) is the space of all continuous 
functions g : X ^ M.. The kernel K is said to be universal if the function space Rk generated 
by K is dense in C{X), that is, for any positive e and any function g G C{X), there exists an 
f eRk such that \\f - g\\oo < e. 

Theorem 1. Suppose fn is the solution of (4.5), Rk is induced by a universal kernel K, 
and the sample space X is compact. Then we have 

(1) SA = 0; 

(2) Let B = sup,j, K{x,x) < oo. When —)■ 0 and nA„ —>■ oo, for any e > 0, 

lim P (EEifn) > e) = 0. 

By (1) and (2) and (4.6) we have R{fn) —t R{f*) in probability. 

The Gaussian kernel is universal and B <1. Thus Theorem 1 says that the kernel DWD 
using the Gaussian kernel is Bayes risk consistent. This offers a theoretical explanation to 
the numerical results in Figure 1. 
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5 Real Data Analysis 


In this section, we investigate the performance of kerndwd on four benchmark data sets: the 
BUPA liver disorder data, the Haberman’s survival data, the Connectionist Bench (sonar, 
mines vs. rocks) data, and the vertebral column data. All the data sets were obtained from 
UCI Machine Learning Repository (Lichman, 2013). 

For comparison purposes, we considered the SVM, the standard DWD {q = 1) and the 
generalized DWD models with q = 0.5,4, 8. We computed all DWD models using our R 
package kerndwd and solved the SVM using the R package kernlab (Karatzoglou et ah, 
2004). We randomly split each data into a training and a test set with a ratio 2 : 1. For each 
method using the linear kernel, we conducted a five-folder cross-validation on the training 
set to tune A. For each method using Gaussian kernels, the pair of (a. A) was tuned by the 
five-folder cross-validation. We then fitted each model with the selected A and evaluated its 
prediction accuracy on the test set. 

Table 2 displays the average timing and mis-classification rates. We do not argue that 
either SVM or DWD outperforms the other; nevertheless, two models are highly comparable. 
SVM models work better on sonar and vertebral data, and DWD performs better on bupa 
and haberman data. For three out of the four data sets, the best method uses a Gaussian 
kernel, indicating that linear classifiers may not be adequate in such cases. In terms of 
timing, kerndwd runs faster than kernlab in all these examples. It is also interesting to 
see that DWD with q = 0.5 can work slightly better than DWD with g = 1 on bupa and 
haberman data, although the difference is not significant. 

6 Discussion 

In this paper we have developed a new algorithm for solving the linear generalized DWD 
and the kernel generalized DWD. Gompared with the current state-of-the-art algorithm for 
solving the linear DWD, our new algorithm is easier to understand, more general, and much 
more efficient. DWD equipped with the new algorithm can be computationally more efficient 
than the SVM. We have established the statistical learning theory of the kernel generalized 
DWD, showing that the kernel DWD and the kernel SVM are comparable in theory. Our 
theoretical analysis and algorithm do not suggest DWD with g = 1 has any special merit 
compared to the other members in the generalized DWD family. Numerical examples further 
support our theoretical conclusions. DWD with g = 1 is called the standard DWD purely 
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Table 2. The mis-classification rates and timings (in seconds) for four benchmark data sets. Each 
data set was split into a training and a test set. On the training set, the tuning parameters were 
selected by five-fold cross-validation and the models were fitted accordingly. The mis-classification 
rates were assessed on the test sets. All the timings include tuning parameters. For each dataset, 
the method with the best prediction accuracy is marked by black boxes. 






Bupa 


Haberman 



Sonar 


Vertebral 





n = 

345. p = 

6 

n = 

305, p = 3 


n = 

to 

o 

00 

■73 

; II 

60 

n = 

310, p = 

6 




error 

(%) 

time 

error 

(%) 

time 

error 

(%) 

time 

error 

(%) 

time 


SVM 


31.63 

(0.50) 

17.47 

26.97 

(0.53) 

11.74 

25.97 

(0.66) 

8.01 

114.831 

(0.42) 

8.07 

CP 

i=l 

i-l 

DWD q = 

1 

34.82 

(0.75) 

0.05 

26.71 

(0.54) 

0.03 

25.65 

(0.75) 

0.30 

16.76 

(0.53) 

0.07 

i-l 

DWD q = 

0.5 

34.23 

(0.72) 

0.06 

26.73 

(0.53) 

0.04 

25.10 

(0.72) 

0.35 

16.54 

(0.51) 

0.10 

CP 

s 

DWD q = 

4 

35.08 

(0.71) 

0.05 

26.69 

(0.55) 

0.03 

26.00 

(0.76) 

0.32 

16.54 

(0.53) 

0.06 


DWD q = 

8 

35.08 

(0.76) 

0.06 

26.53 

(0.56) 

0.03 

25.97 

(0.71) 

0.34 

17.01 

(0.53) 

0.06 

"q; 

SVM 


32.23 

(0.48) 

6.57 

27.92 

(0.61) 

6.00 

15.65 

(0.56) 

8.96 

16.50 

(0.46) 

6.07 

i-, 

M 

DWD q = 

1 

32.14 

(0.63) 

2.83 

26.46 

(0.57) 

2.03 

20.67 

(0.76) 

0.83 

17.57 

(0.49) 

2.23 


DWD q = 

0.5 

31.62 

(0.61) 

2.80 

26.42 

(0.58) 

2.06 

21.42 

(0.79) 

0.84 

17.59 

(0.56) 

2.27 


DWD q = 

4 

31.63 

(0.61) 

3.05 

26.42 

(0.57) 

2.08 

20.26 

(0.76) 

0.91 

17.15 

(0.50) 

2.28 

o 

DWD q = 

8 

32.07 

(0.57) 

3.28 

26.53 

(0.56) 

2.21 

20.00 

(0.67) 

0.98 

16.93 

(0.50) 

2.39 


due to the fact that it, not other generalized DWDs, can be solved by SOCP when the DWD 
idea was hrst proposed. Now with our new algorithm and theory, practitioners have the 
option to explore different DWD classihers. 

In the present paper we have considered the standard classihcation problem under the 0-1 
loss. In many applications we may face the so-called non-standard classihcation problems. 
For example, observed data may be collected via biased sampling and/or we need to consider 
unequal costs for different types of mis-classihcation. Qiao et ah (2010) introduced a weighted 
DWD to handle the non-standard classihcation problem, which follows the treatment of the 
non-standard SVM in Lin et al. (2002). Qiao et ah (2010) dehned the weighted DWD as 
follows. 


mm 

/3o,/3 


( - + c6 


2 = 1 


, subject to Tj = yfifio -\- xjf3) > 0 and 0^(3 = 1, (6.1) 


which can be further generalized to the weighted kernel DWD: 


min C^„(/? 0 ) ck) = Kiin 

00 , 0 . 00,0 


1 

-^w{yi)Vq {yfifio + Kja.)) + Xol^Kol 


( 6 . 2 ) 
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Qiao et al. (2010) gave the expressions for for various non-standard classification 

problems. Qiao et al. (2010) solved the weighted DWD with q = 1 (6.1) based on the 
second-order-cone programming. The MM procedure for Algorithm 1 and Algorithm 2 can 
easily accommodate the weight factors tc(?/j)’s to solve the weighted DWD and weighted 
kernel DWD. We have implemented the weighted DWD in the R package kerndwd. 


Appendix: technical proofs 


Proof of Lemma 1 

Write Vi = HiiojQ + xfu:) and G^rji) = l/{vi -|- -|- crji. The objective function of (2.10) 

can be written as ^ 2^=1 We next minimize (2.10) over rji for every hxed i by computing 

the hrst-order and the second-order derivatives of G{r]i): 




{Vi + 

q{q + l) 


+ c 


(vi + r/i)'?+2 


> 0 . 


0 ^ Vi+ r]i 




If Vi > (©''+D then G'{rii) > 0 for all rji > 0, and rj* = 0 is the minimizer. If Vi < (©"+1, 
then rj* = (©^ — D is the minimizer as G'ijf) = 0 and G"{r]*) > 0. 

By plugging in the minimizer r]* into G{'qi), we obtain 


n 

min > Vq {yi{ijjQ -|- xfuj)) , subject to = 1, (6.3) 

2 = 1 


if ^ 


where 

We now simplify (6.3). Suppose t = (-rj)(©~^ and G = (^)(-)^- We dehne Vq{u) 


ti ■ Vq{u/t) for each q, 


Vqiu) = 


u. 


1 g'? 


if M < 


Q 


g +1’ 
Q 


—-T, if M > 

(g + 1)'^+^ g + 1 
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By setting /3o = t ■ Uq and f3 = t ■ uj, we find that (6.3) becomes 


min E Vq {vtilSo + xj’f3)) , subject to f3'^f3 = 


/3o,/3 ^ 
1=1 


which can be further transformed to (3.1) with A and t one-to-one correspondent. 

Proof of Lemma 2 

We hrst prove (3.5). We observe that 0 < V^'{u) = , for any u > 

Also Vq{u) is continuous on [^, cx)) and differentiable on (^, cxd). 

If both Ui and U 2 > then the mean value theorem implies that there exists u** > 
such that, 




(g + 1)' 


If Ml > ^ and U 2 < then V^{u 2 ) = (^) ~ mean value theorem 

implies that there exists u** > satisfying 




|mi - M 2 I 


I«1 - 


If both Ml and M 2 < V^^iui) = V^{u 2 ) = —1. It is trivial that 


ip'(m)-r>2)l „, fa + y 

I'Ul - M2I 


By (6.4), (6.5), and (6.6), we prove (3.5). 

(q 1 )^ 

We now prove (3.6). Let z/(a) = —--— Vq{a). From (3.5), it is not hard to show 

Zq 

(q -|- 1)^ 

that = - a — V^{a) is strictly increasing. Therefore z/(a) is a strictly convex 

function, and its hrst-order condition, z/(t) > z/(t) -|- i^'(t)(t — i), verihes (3.6) directly. 

Proof of Lemma 3 
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Given that rj{x) = P{Y = 1|X = x), we have that Exy [^(^/(^))] = ExCifi^))'- 


C{f{x)) = ri{x)Vq{f{x)) + [1 - r]{x)]Vq{-f{x)) 


r]{x) 


f{x)<i (g + !)'?+! 

= { V{x)[l - f{x)] + [1 - v{x)][l + f{x)], 


+ [l-ri{x)][l +f{x)], 

if 


V{x)[l- f{x)] + [l-ri{xl- 


qH 


g +1 


g +1’ 

< f{x) < 


g +1’ 


-f{x)Y (g + 1)9+1 


, if f{x) < 


g + 1 


For each given x, we take both f{x) and ri{x) as scalars and hereby write them as / and 
7] respectively. We then take C(/) = C(/(*)) as a function of / and compute the derivative 
with respect to /: 


r 1 

7] 


dm 

df 


g 


<?+! 


= < 


fq+1 (^q ^ !)<?+! 

l-2r/, 

, , 1 

-7]+{l- 7]) 


+ 1-V, 


if/> 
if 


g +1’ 

</< 


79+1 


g +1 


(-/)<?+! 1 ) 9 + 1 ’ 


if/< 


g +1 


We see that (1) when r] > 0.5, d({f)/df = 0 only when f = f = ^ > and (2) 


g + 1’ 


5+1 


9+1 \^-n 

when T] < 0.5, d({f)/df = 0 only when f = f = — ^ these two cases, we 

also observe that 


(dm/df<0, if/</, 

\dm/df>o, if/>/, 

which follows that / is the minimizer of C(/)- 

Proof of Lemma 4 

As f{x) was dehned in (4.4), we see that for each x, 


(6.7) 


C ( f { x ) ] = v { x)Vq \^ f { x)j + [1 - v { x)]Vq [- fix )^ 

1 q 

T]{x) + [1 — 7]{x)]'^7]{x)^ , if 77 ( 33 ) < 1/2, 

1 q 

1 — 7]{x) + ?7(a3)^[l — 77 ( 33 )]^, if 77 ( 33 ) > 1/2, 

Vl - 1277(33) - 1 A + Vl + 1277(33) - 1 A ( l - 1277(33) 


- II 


q 

9+1 
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For a G [0,1], we define 7(a) and compute its first-order derivative as follows, 


7(a) 
7 (a) 


1 - -(1 - a) - -(1 + a)^(l - a)^ - — 

2 ^ ^2^ ^ ^ ^ g + 1 


a, 


1 — a \ 9+1 ^ q 


2 2(g + l) \l + a 

1 1 /I 


1 -I- a\ 9+1 


2(^q + 1 ) 2(^q + 1 ) 7 I + a 


2(g + l) Vl-a 

q 


q 


Q ■ 

\ 9 + 1 

+ 

_ 

_ 


+ 


q + 1 
q 


2(g + l) 2(g + l)Vl-a 


1 a\ 9+1 


q + l 


> 0 . 


Hence for each a G [0,1], 7(a) > 7(0) = 0. For each x, let a = |2?7(a;) — 1| and we see that 

1 - C (/(a?)) > g^|2h(a^) - 1|- 

By R{f) = Exy[Y 7^ sign(/(X)] = ^{X:/(x)>o}[l - v{X)] + E{x-.f(x)mv{X), we obtain 


R{fn) R{f ) — Y^{X:f^(X)>0, f*{X)<0}\\ 2?7(X)] -|- [2?7(X) 1] 

<7 

q + l 


— -^{X:/„(X)/*(X)<0}|2h(^) 1| 


< 


-E 


q 


{X:/„(X)/*(X)<0} 


i-C[f{x) 


( 6 . 8 ) 

Since f*{X) and f{X) share the same sign, fn{X)f*{X) < 0 implies that fn{X)f{X) < 
0. When fn{X)f{X) < 0, 0 is between /„(X) and f{X), and thus (6.7) indicates that 
({f{X)) < C(0) = 1 < C{fn{X)). From (6.8), we conclude that 


R{fn) - R{n < —E 


< 


q 

q + l 

q 

q + l 

q 

q + l 

q 


{X:/„(X)/*(X)<0} 

Ex 


E 


XY 


c [ux)]-c{fix) 

C [fn{X)j - c (/(X) 

(yUx)) - V, (Yf{X) 


{Ea + £e)- 


Proof of Theorem 1 

Part (1). We hrst show that when Rk is induced by a universal kernel, the approximation 
error ea = 0. By dehnition, we need to show that for any e > 0, there exists G Rk such 
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that 


ExyV, {YUX)) - ExyV, (f/(X)) < e. (6.9) 

We first use truncation to consider a truncated version of /. For any given S G (0,0.5), 
we define 

if„(x)>l-S, 

MX) = < f{X), if - 6 < ,,(X) < 1 - 

We have that 

0 < ExyV^ {Yfs{X)) - ExyV, {Yf{X)) = «+ + 

where 

K+ |))(V)n(/j(X)) + (1 - r,(X))n(-/j(X))| 

- ExMx)>i-i [i(X)V, (/(X)) + (1 - 1)(X))V, (-/(V))] , 

K_ =Ex,^^x)<s HX)V,(fs[X)) + (1 - ^(X))V,(-fs(X))] 

- Exmxxs [>)(V)n (/(X)) + (1 - ,i[X))V, (-/(X))‘ , 

Since Vq{fs{X)) < Vg{-fs{X)) when r]{X) >1-5, 

«:+ <Ex:,ix)>i-s [(1 - S)VMX)) + 5V,{-fs{X))] 

- Ex:XX)>i-5 [>)(V)n (/(X)) + (1 - r,(X))V, (-/(X))‘ 

= ^ + -£x.,(X)>i-j[l-r?(X)+,,(X)>(l-r,(X))>‘ . 

1 q 

We notice that (1 — a) + a'?+i (1 — a) 9+1 is a continuous function in terms of a G (0,1). Since 
’l(X) >1 — 5 implies that \ri{X) — (1 — 5)| < 5, we conclude that for any given e > 0, there 
exists a sufficiently small 5 such that < e/6. We can also obtain < e/6 in the same 
spirit. Therefore, 

0 < ExyV, {Yfs{X)) - ExyV, (f/(X)) < k++ k. < e/3. (6.10) 

By Lusin’s Theorem, there exists a continuous function f?(X) such that P(f)(X) 7 ^ 
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fs{X)) < e{q + l)/(6g). Notice that supj^ |/ 5 (X)| < q/{q + 1). Define 






t(X) = 


^ if\,{X)\> ^ 


g +1’ 


g + 1 l^(^)l 

then P{t{X) ^ fsiX)) < e(g + l)/(6g) as well. Hence 


g + 1’ 


ExyV, {Yfs{X)) - ExyV, {Yt{X)) 


< 


Ex\fs{X)-T{X) 


= E{x-.T(X)^fs[X)}\f5iX) - t{X)\ 

< 

g + 1 6g 

where the first ineqnality conies from the fact that V^(m) is Lipschitz continuous, i.e., 


|Vq(Ml) - Vq{u2)\ < \ui - M 2 I, Vmi,M2 € M. 


Notice that t{X) is also continuous. The definition of the universal kernel implies the 
existence of a function G Hk such that 


ExYVq (F/.(X)) - ExYVq {Yt{X)) 


< sup \fe{X) 

X 


r(X)\ < £/3. 


( 6 . 11 ) 


By combining (6.10), (6.11), and (6.11) we obtain (6.9). 

Part (2). In this part we bound the estimation error eE{fn)- Note that RKHS has the 
following reproducing property (Wahba, 1990; Hastie et ah, 2009): 


{K{xi,x),f{x))u^ = f{xi), 

( 6 . 12 ) 

{K{xi,x),K{xj,x))nK = K{xi,Xj). 

Fix any e > 0. By the KKT condition of (4.5) and the representor theorem, we have 

1 ^ 

- yiK{Xi,x) + 2Xnfn{x) = 0. (6.13) 


We define as the solution of (4.5) when the fcth observation is excluded from the training 












data, i.e., 


fm 

= argmin 
f&HK 


n 


yq{yi{f{xi)) + K\\f\\u^ 


i=l,i^k 


(6.14) 


By the definition of and the convexity of Vh, we have 


0 <- Y ^9 {yJnixi)'^ + X 




nWJnW'Hj^ 


-- Y H 


-A„||/"=l|| 


i=\^i^k 


S - ^ E K j,. (/W(a:.) - /„(iE,)) + A„||/„||?,,, - A„||/[‘l|| 


T-Lk 


2 

"Hk ■ 


By the reproducing property, we further have 
1 

2 =l, 27 ^fc 


Hk 


0<-- Y yq(yifni^i))yi(K{^i:X):P^\^)- fnix)) + 11/n 11111 

z=l^zjtk 
1 ^ 

= - - Y K {yifni^i)) yi a;), - fn{x)'^ 

2 =l, 27 ^fc 

- 2A„ /fn{x)J^’'\x) - fn{x)) - A„||/A^' - fnWn^ 

' / "Hk 

= (ykfniXk)^ Vk (^K{Xk,x)J^’‘\x) - fn{x)'j^ - A„||/A^' - fn\\n^, 

where the equality in the end holds by (6.13). Thus, by Cauchy-Schwartz inequality, 

nXnWf'-’’^ - fnWnK S y (vkfn{xt)'j Vk (K{Xt, x)J^'‘\x) - fn(x)'j ^ 
y (!/J„(iEy)| ||A'(a:j,a:)||„,,||/W - f„\\n„ < kjK(xk,Xk) ■ H/f*' - /„||«„, 


2 

'Hk 


which implies 


fnWuk < 


nXn ’ 


where B = sup a, K{x, x). By the reproducing property, we have 

\f^'"\xk) - fn{Xk)\^ = (^{K{Xi,Xk)J^'"\xi) - fn{Xi))nj,'^ 


< K{Xk,Xk)\\f^^^ - fn\\l< B 
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By the Lipschitz continuity of the DWD loss, we obtain that for each /c = 1,..., n, 


Vg - Vg (ykfn(Xk)) < - fn(Xk)j < 


B 

nXn 


and therefore, 


n 




k=l 


<-^yq (vkfniXk)^ + — 
^ k=l 


nXr. 


Let f* G Bk such that 


By dehnition of /„, we have 


(6.15) 


ExrVg {Yf:{X)) < inf ExyVg {Yf{X)) + e/3. (6.16) 

f&T-LK 


-| 'ft -| li 

-^yg[ykfn{Xk)^ +A„||/n||?^^ < - ^Vq{ykf*{Xk)) + Xn\\f*\\li^. 
k=l k=l 


(6.17) 


Since each data point in = {{xkiyk)}^=i is drawn from the same distribution, we have 


En 


^ykf^^K^k) 


k=l 


n 




fc=i 


= Et,_,ExyV,(y 


(6.18) 


By combining (6.15)-(6.18) we have 


Et„_,ExyV,(YU-,{X) ] < ExyV, (Yf(X)) + A„||/;||?,, + ^ + |. (6-19) 


By the choice of Xn, we see that there exits such that when n > we have Xn < 
e/(3||//|||^ ), nXn > 35/e, and hence 


Ej’ 


ExyVgiYfn-iiX] 


< irif ExYV,(Yf(X)) + Y 

J^riK 


Because e is arbitrary and ET^_i[ExyVg(Yfn-i{X))] > infBg (E/(X)), we have 
hm„^oo ET^_^[ExyVq{Yfn-i{X))] = inf/g^^ ExyVg (Yf{X)), which equivalently indicates 
that hm„^oo 5T„£E(/n) = 0. Since eE{fn) > 0, then by Markov inequality, we prove part 
( 2 ). 
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